What is a distributed database and how do they work?

The cloud is, by definition, a distributed system. Distributed application architecture is mainstream in the world of software — you wouldn’t build any other way. For more than a decade, though, the database has lagged behind. The architecture of traditional relational databases frequently conflicts with the architectural needs of modern cloud applications: horizontal scale, elasticity, and microservices. Distributed databases arose as the answer to this conflict, and they are becoming mainstream, too. So what is a distributed database, and when should you use one?

What is a distributed database?

A distributed database is a database that runs and stores data across multiple computers, as opposed to doing everything on a single machine.

Typically, distributed databases operate on two or more interconnected servers on a computer network. Each location where a version of the database is running is often called an instance or a node.

A distributed database, for example, might have instances running in New York, Ohio, and California. Or it might have instances running on three separate machines in New York. A traditional single-instance database, in contrast, only runs in a single location on a single machine.

What is a distributed database used for?

There are different types of distributed databases and different distributed database configuration options, but in general distributed databases offer several advantages over traditional, single-instance databases:

Distributing the database increases resilience and reduces risk. If a single-instance database goes offline (due to a power outage, machine failure, scheduled maintenance, or anything else) all of the application services that rely on it will go offline, too. Distributed databases, in contrast, are typically configured with replicas of the same data across multiple instances, so if one instance goes offline, other instances can pick up the slack, allowing the application to continue operating.

Different distributed database types and configurations handle outages differently, but in general almost any distributed database should be able to handle outages better than a single-instance database.

For this reason, distributed databases are an increasingly popular choice, particularly for mission-critical workloads and any data that needs to remain available at all times.

Distributed databases are generally easier to scale. As an application grows to serve more users, the storage and computing requirements for the database will increase over time — and not always at a predictable rate.

Trying to keep up with this growth when using a single-instance database is difficult – you either have to pay for more than you need so that your database has “room to grow” in terms of storage and computing power, or you have to navigate regular hardware upgrades and migrations to ensure the database instance is always running on a machine that’s capable of handling the current load.

Distributed databases, in contrast, can scale horizontally simply by adding an additional instance or node. In some cases, this process is manual (although it can be scripted), and in the case of serverless databases it is entirely automated. In almost all cases, the process of scaling a distributed database up and down is more straightforward than trying to do the same with a single-instance database.

Distributing the database can improve performance. Depending on how it is configured, a distributed database may be able to operate more efficiently than a single-instance database because it can spread the computing workload between multiple instances rather than being bottlenecked by having to perform all reads and writes on the same machine.

Geographically distributing the database can reduce latency. Although not all distributed databases support multi-region deployments, those that do can also improve application performance for users by reducing latency. When data can be located on a database instance that is geographically close to the user who is requesting it, that user will likely have a lower-latency application experience than a user whose application needs to pull data from a database instance that’s (for example) on the other side of the globe.

Depending on the specific type, configuration, and deployment choices an organization makes, there may be additional benefits to using a distributed database. Let’s look at some of the options that are available when it comes to distributed databases.

Types of distributed databases: NoSQL vs. distributed SQL databases

Broadly, there are two types of distributed databases: NoSQL and distributed SQL. (Document-based and key-value are two other terms often used to describe NoSQL databases, so you may sometimes see these options compared as “document based vs. relational,” for example).

To understand the difference between them, it’s helpful to take a quick dive into the history of databases.

Humans have been storing data in various formats for millennia, of course, but the modern era of computerized databases really began with Edgar F. Codd and the invention of the relational (SQL) database. Relational databases store data in tables and enforce rules – called schema – about what types of data can be stored where, and how the data relate to each other.

Relational databases and SQL, the programming language used to configure and query them, caught on in the 1970s and quickly became the default database type for virtually all computerized data storage. Transactional applications, in particular, quickly came to rely on relational databases for their ability to support ACID transactional guarantees – in essence, to ensure that transactions are processed correctly, can’t interfere with each other, and remain true once they’re committed even if the database subsequently goes offline.

After the explosion of the internet, though, it became clear that there were limitations to the traditional relational database. In particular, it wasn’t easy to scale, it wasn’t built to function well in cloud environments, and distributing it across multiple instances required complex, manual work called sharding.

In part as a response to this, a new class of databases called NoSQL databases arose. These databases were built to be cloud-native, resilient, and horizontally scalable. But to accomplish those goals, they sacrificed the strict schema enforcement and ACID guarantees offered by traditional relational databases, storing data in a less structured format. At scale, NoSQL databases have appealing advantages over traditional relational databases, but particularly for transactional workloads, they also require making compromises when it comes to data consistency and correctness.

In recent years, a new class of relational database – “new SQL”, a.k.a. the distributed SQL database – has emerged, aiming to offer a best-of-both-worlds option. Distributed SQL provides the cloud-native scaling and resilience of NoSQL databases and the strict schema, consistency, and ACID guarantees of traditional relational databases.

Unlike traditional relational databases, distributed SQL databases don’t require manual work to distribute and scale. But they can still offer ACID guarantees, making them a highly appealing prospect for any organization with important transactional workloads.

Today, both NoSQL and distributed SQL databases are widely used, and many organizations use both types. Broadly speaking, NoSQL databases are common choices for analytics and big data workloads, while distributed SQL databases are used for transactional workloads and other applications such as system-of-record stores where data consistency can’t be sacrificed for availability and scale. For this reason, a distributed SQL database may sometimes be called a distributed transactional database.

Distributed database configurations: active-passive vs. active-active vs. multi-active

One of the main goals of a distributed database is high availability: making sure the database and all of the data it contains are available at all times. But when a database is distributed, its data is replicated across multiple physical instances, and there are several different ways to approach configuring those replicas.

Active-passive

The first, and simplest, is an active-passive configuration. In an active-passive configuration, all traffic is routed to a single “active” replica, and then copied to the other replicas for backup.

In a three-node deployment, for example, all data might be written to an active replica on node 1 and then subsequently copied to passive replicas on nodes 2 and 3.

This approach is straightforward, but it does introduce potential problems. In addition to the performance bottleneck that routing all reads and writes to a specific replica can present, problems can also arise depending on how new data is written to the passive “follower” replicas:

If the data is replicated synchronously (immediately) and writing to one of the “follower” replicas fails, then you must either sacrifice availability (the database will become unavailable unless all three replicas are online) or consistency (the database may have replicas with conflicting data, as an update can be written two the active replica but fail to write to one of the passive follower replicas.
If the data is replicated asynchronously, there’s no way to guarantee that data makes it to the passive follower replicas (one could be online when the data is written to the active replica but go offline when the data is subsequently replicated to the passive followers). This introduces the possibility of inconsistencies and even potentially data loss.

In summary, active-passive systems offer one of the most straightforward configuration options – particularly if you’re trying to manually adapt a traditional relational database for a distributed deployment. But they also introduce risks and trade-offs that can impact database availability and consistency.

Active-active

In active-active configurations, there are multiple active replicas, and traffic is routed to all of them. This reduces the potential impact of a replica being offline, since other replicas will handle the traffic automatically.

However, active-active setups are much more difficult to configure for most workloads, and it is still possible for consistency issues to arise if an outage happens at the wrong time.

For example, imagine an active-active system with replicas A and B:

A receives a write for key xyz with the value 123, and then immediately fails and goes offline. A subsequent read for xyz is thus routed to B, and returns NULL, because xyz = 123 hadn’t yet been copied to B when A went offline. The application, seeing that there isn’t a current value for xyz, sends an xyz = 456 write to B. A comes back online.

At the end of this sequence, we have an inconsistency: A says xyz = 123 and B says xyz = 456. While such a scenario is not common, inconsistencies like this one have the potential to cause a lot of trouble when they do happen, so active-active setups must be configured and tested very carefully to attempt to mitigate this risk.

Multi-active

Multi-active is the system for availability used by CockroachDB, which attempts to offer a better alternative to active-passive and active-active configurations.

Like active-active configurations, all replicas can handle both reads and writes in a multi-active system. But unlike active-active, multi-active systems eliminate the possibility of inconsistencies by using a consensus replication system, where writes are only committed when a majority of replicas confirm they’ve received the write.

Distributed transactions: What, why, and how to build a distributed transactional application

A majority of replicas thus define what is correct, allowing the database to remain both online and consistent even if some replicas are offline at the time of writing. If a majority of replicas are offline, the entire database becomes unavailable to prevent the introduction of inconsistent data.

Distributed databases vs. cloud databases

Since we’re discussing configuration options for distributed databases, it’s worth pointing out that although the terms distributed database and cloud database are sometimes used interchangeably, they’re not necessarily the same thing.

A distributed database is any database that’s distributed across multiple instances. Often, these instances are deployed to a public cloud provider such as AWS, GCP, or Azure, but they don’t have to be. Distributed databases can also be deployed on-premises, and some even support hybrid cloud and multi-cloud deployments.

A cloud database is any database that’s been deployed in the cloud (generally a public cloud such as AWS, GCP, or Azure), whether it’s a traditional single-instance deployment or a distributed deployment.

In other words, a distributed database might be run in the cloud, but it doesn’t have to be. Similarly, a cloud database might be distributed, but it doesn’t have to be.

Pros and cons of distributed databases

We’ve already discussed the pros of distributed databases earlier in this article, but to quickly review, the reasons to use a distributed database are generally:

High availability (data is replicated so that the database remains online even if a machine goes down).
High scalability (they can be easily scaled horizontally by adding instances/nodes)
Improved performance (depending on type, configuration, and workload)
Reduced latency (for distributed databases that support multi-region deployments).

Beyond those, specific distributed databases may offer additional appealing features. CockroachDB, for example, allows applications to treat the database as though it were a single-instance deployment, making it simpler to work with from a developer perspective. It also offers CDC changefeeds to facilitate its use within event-driven applications.

The cons of distributed databases also vary based on the specifics of the database’s type, configuration, and the workloads it’ll be handling. In general, though, potential downsides to a distributed database may include:

Increased operational complexity. Deploying, configuring, managing, and optimizing a distributed database can be more complex than working with a single-instance DB. However, many distributed databases offer managed DBaaS deployment options that can deal with the operational work for you.
Increased learning curve. Distributed databases work differently and it typically requires some time for teams to adapt to the new set of best practices. In the case of NoSQL databases, there may also be a learning curve for developers who aren’t familiar with the language, as some popular NoSQL databases use proprietary query languages. (Distributed SQL databases, on the other hand, use a language that most developers are already familiar with: SQL).

Beyond these factors, though, there are a variety of additional factors that must be assessed on a case-by-case basis.

Cost, for example, is a significant factor for most organizations, but it’s not possible to say that a distributed database is cheaper or more expensive – it depends on the database you pick, how you choose to deploy it, the workload requirements, how it’s configured, etc.

In principle, a distributed database might sound more expensive, as it runs on multiple instances rather than a single one. In practice, though, they can often be cheaper – especially when you factor in the cost of your database becoming unavailable. For large companies dealing with thousands of transactions per minute, even a few minutes of downtime can result in losses in the millions of dollars.

Similarly, managed DBaaS deployment options can look more expensive than self-hosted options at first, but they also significantly reduce the operational workload that has to be carried by your own team, which can make them the cheaper option.

For this reason, companies typically spend significant amounts of time and money testing and evaluating their database options, to determine what’s the best option for their specific budget and their specific requirements.

How a distributed database works

Distributed databases are quite complicated, and entire books could be written about how they work. That level of detail is outside the scope of this article, but we will take a look at how one distributed SQL database, CockroachDB, works at a high level.

From the perspective of your application, CockroachDB works very similarly to a single Postgres instance – you connect and send data to it in precisely the same way. But when the data reaches the database, CockroachDB automatically replicates and distributes it across three or more nodes (individual instances of CockroachDB).

To understand how this occurs, let’s focus on what happens to a single range – a chunk of data – when it’s written to the database. For the purposes of simplicity, we’ll use the example of a three-node, single-region cluster, although CockroachDB can support multi-region deployments and many, many more nodes.

In our example, when the data in a range is sent to the database, it is written into three replicas – three copies of the data, one on each node. One of the three nodes is automatically designated the leaseholder for this range, meaning that it coordinates read and write requests relating to the data in that range. But any node can receive requests, distinguishing CockroachDB from active-passive systems in which requests must pass through the central “active” node.

Consistency between the replicas on each node is maintained using the Raft consensus algorithm, which ensures that a majority of replicas agree on the correctness of data being entered before a write is committed. This is how CockroachDB achieves its multi-active designation – like an active-active system, all nodes can receive read and write requests, but unlike an active-active system, there is no risk of consistency problems arising.

Of course, in practice it’s all a bit more complex than that makes it sound! For a full accounting of how CockroachDB works, this architecture guide is a good starting point.